The `pgvectorscale` extension adds diskANN index support for pgvector.
This extension is useful in cases where `pgvector`'s `hnsw` index does not fit into available memory and as a result the ANN search does not perform as expected.

## Key Features

- StreamingDiskANN index - disk-backed HNSW variant.
- Statistical Binary Quantization (SBQ)
- Label-based filtering combined with DiskANN index.

## Example: DiskANN index on shared table

To keep the example readable we'll work with **3-dimensional vectors**.
Swap `VECTOR(3)` for `VECTOR(768)` or `VECTOR(1536)` in real apps.

```sql
-- 1. Shared data table
CREATE TABLE document_embedding (
  id        BIGSERIAL PRIMARY KEY,
  contents  TEXT,
  metadata  JSONB,
  embedding VECTOR(3)
);

-- 2. Seed with tiny sample data
INSERT INTO document_embedding (contents, metadata, embedding) VALUES
  ('T-shirt',        '{"category":"apparel"}', '[0.10, 0.20, 0.30]'),
  ('Sweater',        '{"category":"apparel"}', '[0.12, 0.18, 0.33]'),
  ('Coffee mug',     '{"category":"kitchen"}', '[0.90, 0.80, 0.70]');

-- 3. Build a DiskANN index (cosine distance)
CREATE INDEX document_embedding_diskann_idx
  ON document_embedding
  USING diskann (embedding vector_cosine_ops);

-- 4. k-NN query (top-2 similar items)
SELECT id, contents, metadata
FROM   document_embedding
ORDER  BY embedding <=> '[0.11, 0.21, 0.29]'  -- query vector
LIMIT  2;
```

You should see the two apparel rows first - a good sanity check that the index works.

## Example: DiskANN index on tenant-aware table

```sql
-- 1. Tenant-aware table
CREATE TABLE tenant_embedding (
  tenant_id UUID        NOT NULL,
  doc_id    BIGINT,
  embedding VECTOR(2),  -- using tiny 2‑dim vectors for demo
  metadata  JSONB,
  PRIMARY KEY (tenant_id, doc_id)
);

-- 2. Create some tenants
INSERT INTO tenants (id, name) VALUES
  ('11111111-1111-1111-1111-111111111111', 'Tenant A');
INSERT INTO tenants (id, name) VALUES
  ('22222222-2222-2222-2222-222222222222', 'Tenant B');

-- 3. Seed soome data
INSERT INTO tenant_embedding (tenant_id, doc_id, embedding, metadata) VALUES
  ('11111111-1111-1111-1111-111111111111', 1, '[0.05, 0.95]', '{"title":"DocA"}'),
  ('11111111-1111-1111-1111-111111111111', 2, '[0.04, 0.90]', '{"title":"DocB"}');
INSERT INTO tenant_embedding (tenant_id, doc_id, embedding, metadata) VALUES
  ('22222222-2222-2222-2222-222222222222', 1, '[0.80, 0.20]', '{"title":"DocC"}');

-- 3. Create an index (Nile will partition by tenant_id)
CREATE INDEX tenant_embedding_diskann_idx
  ON tenant_embedding
  USING diskann (embedding vector_cosine_ops);

-- 4. Tenant‑scoped ANN query
SET nile.tenant_id = '11111111-1111-1111-1111-111111111111';
SELECT doc_id, metadata
FROM   tenant_embedding
ORDER  BY embedding <=> '[0.06, 0.92]'
LIMIT  2;
```

## Example: Label-based filtering

Label-based filtering is a technique that allows you to filter the results of an ANN search based on a label while using the DiskANN index.
Other filters are supported, but will use pgvector's post-filtering (i.e. after the ANN search).

In order to use label based filtering, you need to:

- Create a label column in your table. It has to be an array of `smallint`s. Other types will revert to using the post-filtering.
- Create a diskann index that uses both the embedding and the label column.
- Use the `&&` (array intersection) operator in search queries.
- Optional, but recommended: Use a separate table and joins to translate smallint labels to meaningful descriptions.

```sql
-- 1. Create a label column
CREATE TABLE documents (
  id        BIGSERIAL PRIMARY KEY,
  embedding VECTOR(3),
  labels    SMALLINT[]
);

-- 2. Create an index on the label column
-- Insert a couple of demo rows
INSERT INTO documents (embedding, labels) VALUES
  ('[0.3,0.2,0.1]', ARRAY[1]),        -- label 1 = science
  ('[0.35,0.25,0.05]', ARRAY[1,2]),   -- label 2 = business
  ('[0.9,0.8,0.7]', ARRAY[3]);        -- label 3 = art

-- 3. Create an index on the label column
CREATE INDEX documents_ann_idx
  ON documents
  USING diskann (embedding vector_cosine_ops, labels);

-- 4. Query with label-based filtering
SELECT *
FROM   documents
WHERE  labels && ARRAY[1,2]
ORDER  BY embedding <=> '[0.32,0.18,0.12]'
LIMIT  5;


-- 5. Optional: Translate labels to descriptions
CREATE TABLE labels (
  id        SMALLINT PRIMARY KEY,
  description TEXT
);

INSERT INTO labels (id, description) VALUES
  (1, 'Science'),
  (2, 'Business'),
  (3, 'Art');

-- 6. Query with label-based filtering and description
SELECT d.*
FROM documents d
WHERE d.labels && (
    SELECT array_agg(id)
    FROM labels
    WHERE description in ('Science', 'Business')
)
ORDER BY d.embedding <=> '[0.32,0.18,0.12]'
LIMIT 5;
```

## Limitations

- DiskANN index supports `cosine`, `l2` and `inner_product` distance metrics, not the entire pgvector's set of distance metrics.
- Label-based filtering is only supported for `smallint` arrays and the `&&` operator. Other types will revert to using the post-filtering.
- DiskANN is best suited for datasets where `hnsw` index would be too large to fit into memory. For smaller datasets, `hnsw` is still a good choice.

## Additional Resources

[Pgvectorscale github repository](https://github.com/timescale/pgvectorscale)
